{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding_rank_r_in_LoRA_and_related_Matrix_Math"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](assets/2024-01-08-20-38-59.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "💡 What is rank `r` and the concept of \"low-rank\" matrix factorization in the context of LoRA ?\n",
    "\n",
    "📌 The rank of a matrix in linear algebra measures the dimension of the vector space generated (or spanned) by its columns (or rows). In simpler terms, it tells us the maximum number of linearly independent column vectors (i.e. Column Rank) or row vectors (Row Rank) in the matrix.\n",
    "\n",
    "- It indicates how much information is packed into the matrix.\n",
    "    \n",
    "- For instance, if a matrix is used to represent a set of linear equations, its rank reveals the number of distinct equations.\n",
    "\n",
    "📌 A matrix is low-rank if it has many fewer linearly independent columns than columns. Such matrices can be efficiently represented using rank-factorizations\n",
    "\n",
    "------------\n",
    "\n",
    "📌 The purpose of low-rank factorization is to factorize the matrix into a product of two matrices with low dimensions.\n",
    "\n",
    "📌 The core idea behind LoRA is original weight matrix  W  is adapted by adding a low-rank product of two smaller matrices  BA , where  B  and  A  are the low-rank matrices. So, the adapted weight matrix becomes  W + BA.\n",
    "\n",
    "📌 So when finetuning with LoRA, the original weights  W  are frozen, and only the elements of the low-rank matrices  B  and  A  are trained, leading to a drastic reduction in the trainable parameter count.\n",
    "\n",
    "📌 In traditional fine-tuning, we modify a pre-trained neural network’s weights to adapt to a new task. This adjustment involves altering the original weight matrix ( W ) of the network. The changes made to ( W ) during fine-tuning are collectively represented by ( Δ W ), such that the updated weights can be expressed as ( W + Δ W ).\n",
    "\n",
    "![](assets/2024-01-29-00-28-27.png)\n",
    "\n",
    "Now, rather than modifying ( W ) directly, the LoRA approach seeks to decompose ( Δ W ). This decomposition is a crucial step in reducing the computational overhead associated with fine-tuning large models.\n",
    "\n",
    "📌 But the intrinsic rank hypothesis suggests that significant changes to the neural network can be captured using a lower-dimensional representation. Essentially, it posits that not all elements of ( Δ W ) are equally important; instead, a smaller subset of these changes can effectively encapsulate the necessary adjustments.\n",
    "\n",
    "📌 Building on this hypothesis, LoRA proposes representing ( Δ W ) as the product of two smaller matrices, ( A ) and ( B ), with a lower rank. The updated weight matrix ( W’ ) thus becomes:\n",
    "\n",
    "[ W’ = W + BA ]\n",
    "\n",
    "In this equation, ( W ) remains frozen (i.e., it is not updated during training). The matrices ( B ) and ( A ) are of lower dimensionality, with their product ( BA ) representing a low-rank approximation of ( Δ W ).\n",
    "\n",
    "\n",
    "\n",
    "![](assets/2024-01-29-00-30-05.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "-----------------\n",
    "\n",
    "💡 Example\n",
    "\n",
    "📌 Let's say our original matrix is `d × k`\n",
    "\n",
    "📌 Now with LoRA, for each layer to be trained, the `d × k` weight update matrix `∆W` is represented by a low-rank decomposition `BA`, where B is a `d × r` matrix and A is a `r × k` matrix. The rank of decomposition `r is << min(d,k)`. The default of r is 8.\n",
    "\n",
    "📌 Now the Matrix `A` is initialized by random Gaussian numbers so the initial weight updates have some variation to start with. And the Matrix `B` is initialized by zero so ∆W is zero at the beginning of training.\n",
    "\n",
    "So the rank `r` determines the \"compactness\" of the approximation.\n",
    "\n",
    "By choosing a small `r`, significantly smaller than both `d` and `k`, the idea is to capture the essence of the weight updates in a much more compact, low-dimensional form.\n",
    "\n",
    "-------\n",
    "\n",
    "📌 Let's say you have a weight matrix of `1000x1000` i.e. 1,000,000 weights.\n",
    "\n",
    "📌 Instead of doing backpropagation and modifying all the 1,000,000 weights, we determine a low rank, let's say it 5.\n",
    "\n",
    "📌 Hence, with LoRA we will be introducing two smaller matrices, A (1000x5) and B (5x1000). \n",
    "\n",
    "`A = matrix[1000x5] B = matrix[5x1000]`\n",
    "\n",
    "📌 And now with LoRA, instead of modifying the original 1,000,000 weights, we will only update the weights in matrices A and B during training, which totals to 10,000 weights (5000 in A and 5000 in B).\n",
    "\n",
    "Which is 0.01% of the initial weights.\n",
    "\n",
    "Despite updating only a tiny fraction of the weights (0.01% in your example), this approach can still capture significant information and lead to effective fine-tuning. This is because the low-rank matrices A and B are able to interact with the entire space of the original weight matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------\n",
    "\n",
    "### 📌 Intuitive explanation of the rank of a matrix (in the context of LoRA) ?\n",
    "\n",
    "Think of rows of a Matrix, as points in space. Every row is a point, the numbers giving you the x and y and z (and beyond) coordinates of the point.\n",
    "\n",
    "📌 Do all of these points lie at the origin (0,0,0)? If so, then the rank is 0.\n",
    "\n",
    "📌 Otherwise, do they all lie on a line that passes through the origin? If so, then the rank is 1.\n",
    "\n",
    "📌 Otherwise, do they all lie on a plane that passes through the origin? Well, you guessed it - the rank is 2 in that case.\n",
    "\n",
    "And so on.\n",
    "\n",
    "--------\n",
    "\n",
    "Looking at the image, a 3x3 matrix with two linearly independent rows or columns highlighted - here rank of the matrix is 2.\n",
    "\n",
    "Linear independence is key; for instance, if one row is all zeros, it does not contribute to the rank, and if two rows are multiples of each other, they count as one towards the rank."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](assets/2024-01-08-20-27-48.png)\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
